Scatterplots

PH345: Winter 2025

Phil Boonstra

Edward R Tufte

American political scientist, statistician, and professor emeritus at Yale University

‘Godfather’ of data visualization and visual presentation of information

Author of Visual Display of Quantitative Information (2001)

Photo by Keegan Peterzell - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=40367115

Eight principles of graphical excellence

  • Show the data
  • Reveal the data at several levels of detail
  • Encourage comparison between data
  • Induce the viewer to think about the substance
  • Avoid distorting what the data says
  • Present many numbers in small space
  • Make large data sets coherent
  • Clear purpose: description, exploration, tabulation or decoration

Three datasets:

Dataset
1 2 3
n 142 142 142
mean.x 54.3 54.3 54.3
sd.x 16.8 16.8 16.8
mean.y 47.8 47.8 47.8
sd.y 26.9 26.9 26.9
cor.xy -0.07 -0.06 -0.07

Graphical excellence

Show the data

All datasets have the nearly equal summary statistics:

  • number of observations
  • mean of x and y
  • standard deviation of x and y
  • correlation between x and y (and same regression of y on x)

Dataset Intercept Slope
away 53.43 -0.10
bullseye 53.81 -0.11
circle 53.80 -0.11
dino 53.45 -0.10
dots 53.10 -0.10
h_lines 53.21 -0.10
high_lines 53.81 -0.11
slant_down 53.85 -0.11
slant_up 53.81 -0.11
star 53.33 -0.10
v_lines 53.89 -0.11
wide_lines 53.63 -0.11
x_shape 53.55 -0.11

Scatterplots are simplest bivariate plots

Steps:

  1. Set of paired numbers \((x_i, y_i)\) where \(i\) indexes pairs, e.g. \((x_1, y_1)\) is first pair, \((x_2, y_2)\) is second pair, etc.

  2. Place points on a cartesian coordinate system. Labeling of points reflects assumption that \(x_i\) goes on the x-axis, \(y_i\) goes on y-axis

Example: Figure 7 (Doll, 1955)

Lung-cancer deaths per million in 1950 (\(y\)) against annual per-capita cigarette consumption in 1930 (\(x\)) for 11 countries.

https://www.sciencedirect.com/science/article/pii/S0065230X08609173

Scatterplots imply a relationship

So don’t create a scatterplot if you don’t want to imply a relationship.

https://www.tylervigen.com/spurious-correlations

Dependent vs Independent Variables

Two main types of scatterplots:

  1. \(x\) and \(y\) are both uncontrolled. Goal is to show whether they are co-varying

  2. \(x\) is controlled or “independent” variable, e.g. time, age, dose, or an experimentally controlled variable.

William Playfair (1759-1823)

  • Scottish engineer, economist, proto-government-spy, and many other things

When he wasn’t blackmailing lords and being sued for libel, William Playfair invented the pie chart, the bar graph, and the line graph

Cara Giamo, 2016

https://www.atlasobscura.com/articles/the-scottish-scoundrel-who-changed-how-we-see-data

Case study: Playfair’s graph of prices, wages, and British monarchs

Never at any former time was wheat so cheap, in proportional to mechanical labor, as it is in the present time (Playfair)

Figure 3, https://onlinelibrary.wiley.com/doi/epdf/10.1002/jhbs.20078; Originally from Tufte, p34

Modernization Attempt 1

“Pure” scatterplot

Direct scatterplot of wheat price and wage, connected by consecutive years

Figure 5, https://onlinelibrary.wiley.com/doi/epdf/10.1002/jhbs.20078

Modernization Attempt 2

Create new variable

Now very easy to see Playfair’s claim about inflation-adjusted price of wheat. Statistical graphics should reveal data (Tufte, p13)

Figure 5, https://onlinelibrary.wiley.com/doi/epdf/10.1002/jhbs.20078

Graphical excellence

Reveal the data at several levels of detail

Aesthetics are quantitative mappings of data to visual properties:

  • x and y coordinates
  • size / area / volume
  • color
  • transparency
  • shape

Temperature anomalies

Land and ocean anomalies from 1850 to 2024 with respect to the 1901-2000 average

Separate data for northern and southern hemispheres

Northern hemisphere

Average temperature anomalies in the northern hemisphere over time

https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/global/time-series

Connected lines

Emphasis on interyear variability

https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/global/time-series

Smooth interpolation

Emphasis on trend

https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/global/time-series

Bars

Emphasis on positive vs negative deviation

https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/global/time-series

Ribbon

Emphasis on positive vs negative deviation, also on time spent above or below

https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/global/time-series

Graphical excellence

Encourage comparison between data

Northern, Southern hemispheres

https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/global/time-series

https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/global/time-series

https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/global/time-series

https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/global/time-series

https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/global/time-series

Summary

[Still need]

References

Doll, R., 1955. Etiology of lung cancer. In Advances in cancer research (Vol. 3, pp. 1-50).

Friendly, M. and Denis, D., 2005. The early origins and development of the scatterplot. Journal of the History of the Behavioral Sciences, 41(2), pp.103-130.

Tufte, E.R., 2001. The visual display of quantitative information.